How ‘dark LLMs’ produce harmful outputs, despite guardrails – Computerworld

And it’s not hard to do, they noted. “The ease with which these LLMs can be manipulated to produce harmful content underscores the urgent need for robust safeguards. The risk is not speculative — it is immediate, tangible, and deeply concerning, highlighting the fragile state of AI safety in the face of rapidly evolving jailbreak techniques.”
Analyst Justin St-Maurice, technical counselor at Info-Tech Research Group, agreed. “This paper adds more evidence to what many of us already understand: LLMs aren’t secure systems in any deterministic sense,” he said, “They’re probabilistic pattern-matchers trained to predict text that sounds right, not rule-bound engines with an enforceable logic. Jailbreaks are not just likely, but inevitable. In fact, you’re not ‘breaking into’ anything… you’re just nudging the model into a new context it doesn’t recognize as dangerous.”
The paper pointed out that open-source LLMs are a particular concern, since they can’t be patched once in the wild. “Once an uncensored version is shared online, it is archived, copied, and distributed beyond control,” the authors noted, adding that once a model is saved on a laptop or local server, it is out of reach. In addition, they have found that the risk is compounded because attackers can use one model to create jailbreak prompts for another model.
Source link